AITopics | document-term matrix

Collaborating Authors

document-term matrix

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Uncovering Key Trends in Industry 5.0 through Advanced AI Techniques

Fitsilis, Panos, Tsoutsa, Paraskevi, Damasiotis, Vyron, Kyriatzis, Vasileios

arXiv.org Artificial IntelligenceOct-22-2024

This article analyzes around 200 online articles to identify trends within Industry 5.0 using artificial intelligence techniques. Specifically, it applies algorithms such as LDA, BERTopic, LSA, and K-means, in various configurations, to extract and compare the central themes present in the literature. The results reveal a convergence around a core set of themes while also highlighting that Industry 5.0 spans a wide range of topics. The study concludes that Industry 5.0, as an evolution of Industry 4.0, is a broad concept that lacks a clear definition, making it difficult to focus on and apply effectively. Therefore, for Industry 5.0 to be useful, it needs to be refined and more clearly defined. Furthermore, the findings demonstrate that well-known AI techniques can be effectively utilized for trend identification, particularly when the available literature is extensive and the subject matter lacks precise boundaries. This study showcases the potential of AI in extracting meaningful insights from large and diverse datasets, even in cases where the thematic structure of the domain is not clearly delineated.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.16748

Country:

Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.34)

Industry: Energy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

A Novel Two-Step Method for Cross Language Representation Learning

Neural Information Processing SystemsMar-13-2024, 14:42:00 GMT

Cross language text classification is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Specifically, we first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classification tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small.

classification, representation, unlabeled parallel data unlabeled, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.90)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.90)

Add feedback

How to Use SVD and NMF in Python

#artificialintelligenceMar-10-2023, 17:45:38 GMT

In the context of Natural Language Processing (NLP), topic modeling is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents. Topic Modeling answers the question: "Given a text corpus of many documents, can we find the abstract topics that the text is talking about?" By the end of this tutorial, you'll be able to build your own topic models to find topics in any piece of text. Let's start by understanding what topic modeling is. Suppose you're given a large text corpus containing several documents.

document-term matrix, text corpus, word cloud, (12 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.61)

Add feedback

A comparison of latent semantic analysis and correspondence analysis of document-term matrices

Qi, Qianqian, Hessen, David J., Deoskar, Tejaswini, van der Heijden, Peter G. M.

arXiv.org Artificial IntelligenceNov-25-2022

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, i.e. sums of row elements and column elements, arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.

artificial intelligence, natural language, semantic analysis and correspondence analysis, (2 more...)

arXiv.org Artificial Intelligence

doi: 10.1017/S1351324923000244

2108.06197

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.89)

Add feedback

Learning Neural Networks on SVD Boosted Latent Spaces for Semantic Classification

Sidheekh, Sahil

arXiv.org Machine LearningJan-3-2021

The availability of large amounts of data and compelling computation power have made deep learning models much popular for text classification and sentiment analysis. Deep neural networks have achieved competitive performance on the above tasks when trained on naive text representations such as word count, term frequency, and binary matrix embeddings. However, many of the above representations result in the input space having a dimension of the order of the vocabulary size, which is enormous. This leads to a blow-up in the number of parameters to be learned, and the computational cost becomes infeasible when scaling to domains that require retaining a colossal vocabulary. This work proposes using singular value decomposition to transform the high dimensional input space to a lower-dimensional latent space. We show that neural networks trained on this lower-dimensional space are not only able to retain performance while savoring significant reduction in the computational complexity but, in many situations, also outperforms the classical neural networks trained on the native input space.

latent space, neural network, representation, (15 more...)

arXiv.org Machine Learning

2101.00563

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Stuff Works: A Comprehensive Topic Modelling Guide with NMF, LSA, PLSA, LDA & lda2vec (Part-1)

#artificialintelligenceSep-9-2019, 02:46:03 GMT

This article is a comprehensive overview of Topic Modeling and its associated techniques. This is the first part of the article and will cover NMF, LSA and PLSA only. The LDA and lda2vec will be covered in the next part here. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning -- from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics.

artificial intelligence, machine learning, natural language, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data

Anderlucci, Laura, Viroli, Cinzia

arXiv.org Machine LearningFeb-18-2019

They can be basically defined as a multi-layer stack of algorithms or modules able to gradually learn a huge number of parameters in an architecture composed by multiple nonlinear transformations (LeCun et al., 2015). Typically, and for historical reasons, a structure for deep learning is identified with advanced neural networks: deep Feed Forward, Recurrent, Auto-encoder, Convolution neural networks are very effective and used algorithms of deep learning (Schmidhuber, 2015). They demonstrated to be particularly successful in supervised classification problems arising in several fields such as image and speech recognition, gene expression data, topic classification. When the aim is uncovering unknown classes in a unsupervised classification perspective, important methods of deep learning have been developed along the lines of mixture modeling, because of their ability to decompose a heterogeneous collection of units into a finite number of subgroups with homogeneous structures (Fraley and Raftery, 2002; McLachlan and Peel, 2000). In this direction, van den Oord and Schrauwen (2014) proposed Multilayer Gaussian Mixture Models for modeling natural images; Tang et al. (2012) defined deep mixture of factor analyzers with a greedy layer-wise learning algorithm able to learn each layer at a time. Viroli and McLachlan (2019) developed a general framework for Deep Gaussian mixture models that generalizes and encompasses the previous strategies and several flexible model-based clustering methods such as mixtures of mixture models (Li, 2005), mixtures of Factor Analyzers (McLachlan et al., 2003), mixtures of factor analyzers with common factor loadings (Baek et al., 2010), heteroscedastic factor mixture analysis (Montanari and Viroli, 2010) and mixtures of factor mixture analyzers introduced by Viroli (2010). A general'take-home-message' coming from the existing deep clustering strategies is that deep methods vs shallow ones appear to be very efficient and powerful tools especially for complex high-dimensional data; on the contrary, for simple and small data structures, a deep learning strategy cannot improve performance of simpler and conventional methods or, to better say, it is like to use a'sledgehammer to crack a nut'. The motivating problem behind this work derives from ticket data (i.e.

adjusted rand index, algorithm, high-dimensional data, (15 more...)

arXiv.org Machine Learning

1902.06615

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec

#artificialintelligenceJun-6-2018, 13:26:00 GMT

This article is a comprehensive overview of Topic Modeling and its associated techniques. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning -- from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, recognizing, and extracting these topics across a collection of documents is called topic modeling. In this post, we will explore topic modeling through 4 of the most popular techniques today: LSA, pLSA, LDA, and the newer, deep learning-based lda2vec.

artificial intelligence, machine learning, natural language, (17 more...)

#artificialintelligence

Genre: Overview (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

A Comparison of Machine Learning Algorithms for the Surveillance of Autism Spectrum Disorder

Lee, Scott H, Maenner, Matthew J, Heilig, Charles M

arXiv.org Machine LearningApr-17-2018

The Centers for Disease Control and Prevention (CDC) coordinates a labor-intensive process to measure the prevalence of autism spectrum disorder (ASD) among children in the United States. Random forests methods have shown promise in speeding up this process, but they lag behind human classification accuracy by about 5 percent. We explore whether newer document classification algorithms can close this gap. We applied 6 supervised learning algorithms to predict whether children meet the case definition for ASD based solely on the words in their evaluations. We compared the algorithms? performance across 10 random train-test splits of the data, and then, we combined our top 3 classifiers to estimate the Bayes error rate in the data. Across the 10 train-test cycles, the random forest, neural network, and support vector machine with Naive Bayes features (NB-SVM) each achieved slightly more than 86.5 percent mean accuracy. The Bayes error rate is estimated at approximately 12 percent meaning that the model error for even the simplest of our algorithms, the random forest, is below 2 percent. NB-SVM produced significantly more false positives than false negatives. The random forest performed as well as newer models like the NB-SVM and the neural network. NB-SVM may not be a good candidate for use in a fully-automated surveillance workflow due to increased false positives. More sophisticated algorithms, like hierarchical convolutional neural networks, would not perform substantially better due to characteristics of the data. Deep learning models performed similarly to traditional machine learning methods at predicting the clinician-assigned case status for CDC's autism surveillance system. While deep learning methods had limited benefit in this task, they may have applications in other surveillance systems.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Machine Learning

1804.06223

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Autism (1.00)
Health & Medicine > Public Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)

Add feedback

Text Processing in R

@machinelearnbotMar-11-2018, 22:45:44 GMT

This tutorial goes over some basic concepts and commands for text processing in R. R is not the only way to process text, nor is it always the best way. Python is the de-facto programming language for processing text, with a lot of built-in functionality that makes it easy to use, and pretty fast, as well as a number of very mature and full featured packages such as NLTK and textblob. Basic shell scripting can also be many orders of magnitude faster for processing extremely large text corpora -- for a classic reference see Unix for Poets. Yet there are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. Furthermore, there is a lot of very active development going on in the R text analysis community right now (see especially the quanteda package).

artificial intelligence, natural language, text processing, (12 more...)

@machinelearnbot

Country: North America > United States (0.29)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback